Download Latin.unicharset along with radical-stroke.txt #219

Shreeshrii · 2020-12-17T16:09:49Z

Need another PR to add Inherited.unicharset after tesseract-ocr/langdata_lstm#41 is merged

stweil · 2020-12-17T16:46:37Z

All unicharset files for scripts are potentially needed, starting with Arabic.unicharset and ending with Thai.unicharset.

I usually get the required ones to satisfy the error message(s), but still don't know what happens if they are missing.

Shreeshrii · 2020-12-18T07:15:15Z

I added only Latin and Inherited unicharsets in this list because these are required in almost all cases, even though they don't stop processing like missing radical-stroke.txt.

We could add another optional variable for SCRIPT_UNICHARSET, downloading it when it is non-blank.

still don't know what happens if they are missing.

I think some characters e.g. Arabic accents get dropped in the generated unicharset by unicharset_extractor. That was the reason I built the Inherited.unicharset.

stweil · 2021-01-14T19:29:53Z

Makefile

@@ -303,6 +303,8 @@ $(OUTPUT_DIR).traineddata: $(LAST_CHECKPOINT)
 endif

 $(DATA_DIR)/radical-stroke.txt:
+#	wget -O $(DATA_DIR)/Inherited.unicharset 'https://github.com/tesseract-ocr/langdata_lstm/raw/master/Inherited.unicharset'
+	wget -O $(DATA_DIR)/Latin.unicharset 'https://github.com/tesseract-ocr/langdata_lstm/raw/master/Latin.unicharset'


I'd put that in a separate Makefile target.

Inherited.unicharset is NOT there in langdata_lstm repo. I created it by copying the lines with Inherited from other unicharsets. But there are some differences in coordinates for same character in different unicharsets, so I am not sure which one is to be used.

Hi
how can I get the Inherited.unicharset

stweil · 2021-01-14T19:37:44Z

A list of all required *.unicharset files can be extracted from unicharset:

sed s/.*0,0,0.// $(OUTPUT_DIR)/unicharset | sed 's/ .*//' | sort | uniq | grep "^[A-Z][a-z][a-z]*" | grep -v common

Shreeshrii · 2021-01-15T18:10:17Z

Thanks for the suggestions @stweil and the hint to get the list of required unicharsets from $(OUTPUT_DIR)/unicharset.

I am having a hard time putting it together in a separate Makefile target using the list. Would appreciate if you can make the required change.

Here is what I have tried so far:

SCRIPT_NAMES := $(shell cat $(OUTPUT_DIR)/unicharset | sed s/.*0,0,0.// | sed 's/ .*//' | sort | uniq | grep "^[A-Z][a-z][a-z]*" | grep -v common | sed '/Common/d' | sed '/Inherited/d' | sed '/Joined/d')
SCRIPT_UNICHARSETS = $(foreach script,$(SCRIPT_NAMES),$(script).unicharset)
scriptunicharsets: $(SCRIPT_UNICHARSETS)
$(DATA_DIR)/%.unicharset:%.unicharset
	echo $@
	wget -O $@ 'https://github.com/tesseract-ocr/langdata/raw/master/$@'

wrznr · 2021-01-22T08:21:40Z

@kba Could you pls. have a look at the change request and maybe come up with a proposal?

Shreeshrii · 2021-01-22T12:59:20Z

I added sed '/Common/d' | sed '/Inherited/d' | sed '/Joined/d' to the command suggested by @stweil because there are no unicharsets for Common and Inherited . Joined was being picked up accidentally.

A simpler way maybe asking the user to specify a script and download that.

Shreeshrii · 2021-01-29T15:51:58Z

A simpler way maybe asking the user to specify a script and download that.

I have tried that in the new Makefile-font2model
I think that is a much cleaner way of doing this.

Shreeshrii · 2021-02-02T16:25:00Z

Included as part of #230

Download Latin.traineddata along with radical-stroke.txt

465c5f6

kba approved these changes Dec 17, 2020

View reviewed changes

stweil requested changes Jan 14, 2021

View reviewed changes

Download script unicharsets

21e2340

Shreeshrii closed this Feb 2, 2021

Shreeshrii deleted the PR6 branch February 2, 2021 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download Latin.unicharset along with radical-stroke.txt #219

Download Latin.unicharset along with radical-stroke.txt #219

Shreeshrii commented Dec 17, 2020

stweil commented Dec 17, 2020

Shreeshrii commented Dec 18, 2020

stweil Jan 14, 2021

Shreeshrii Jan 15, 2021

typeoo May 23, 2021

stweil commented Jan 14, 2021 •

edited

Loading

Shreeshrii commented Jan 15, 2021 •

edited

Loading

wrznr commented Jan 22, 2021

Shreeshrii commented Jan 22, 2021 •

edited

Loading

Shreeshrii commented Jan 29, 2021

Shreeshrii commented Feb 2, 2021

Download Latin.unicharset along with radical-stroke.txt #219

Download Latin.unicharset along with radical-stroke.txt #219

Conversation

Shreeshrii commented Dec 17, 2020

stweil commented Dec 17, 2020

Shreeshrii commented Dec 18, 2020

stweil Jan 14, 2021

Choose a reason for hiding this comment

Shreeshrii Jan 15, 2021

Choose a reason for hiding this comment

typeoo May 23, 2021

Choose a reason for hiding this comment

stweil commented Jan 14, 2021 • edited Loading

Shreeshrii commented Jan 15, 2021 • edited Loading

wrznr commented Jan 22, 2021

Shreeshrii commented Jan 22, 2021 • edited Loading

Shreeshrii commented Jan 29, 2021

Shreeshrii commented Feb 2, 2021

stweil commented Jan 14, 2021 •

edited

Loading

Shreeshrii commented Jan 15, 2021 •

edited

Loading

Shreeshrii commented Jan 22, 2021 •

edited

Loading